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Abstract 

We consider a general problem of finding a strategy that minimizes the exponential mo- 
ment of a given cost function, with an emphasis on its relation to the more common criterion 
of minimization the expectation of the first moment of the same cost function. In particular, 
our main result is a theorem that gives simple sufficient conditions for a strategy to be op- 
timum in the exponential moment sense. This theorem may be useful in various situations, 
and application examples are given. We also examine the asymptotic regime and investigate 
universal asymptotically optimum strategies in light of the aforementioned sufficient conditions, 
as well as phenomena of irregularities, or phase transitions, in the behavior of the asymptotic 
performance, which can be viewed and understood from a statistical-mechanical perspective. 
Finally, we propose a new route for deriving lower bounds on exponential moments of certain 
cost functions (like the square error in estimation problems) on the basis of well known lower 
bounds on their expectations. 

Index Terms: loss function, exponential moment, large deviations, phase transitions, universal 
schemes. 



1 Introduction 



Many problems in information theory, communications, statistical signal processing, and related 
disciplines can be formalized as being about the quest for a strategy s that minimizes (or maximizes) 
the expectation of a certain cost function, £(X,s), where X is a random variable (or a random 
vector). Just a few examples of this generic paradigm are the following: (i) Lossless and lossy data 
compression, where X symbolizes the data to be compressed, s is the data compression scheme, 
and £(X, s) is the length of the compressed binary representation, or the distortion (in the lossy 
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case) or a linear combination of both (see, e.g., [61 Chapters 5 and 10]). (ii) Gambling and portfolio 
theory [6j Chapters 6 and 16], where cost function is logarithm of the wealth relative, (hi) Lossy 
joint source-channel coding, where X collectively symbolizes the randomness of source and the 
channel, s is the encoding-decoding scheme and £(X, s) is the distortion in the reconstruction (see, 
e -g-) [27], [28]). (iv) Bayesian estimation of a random variable based on measurements, where X 
designates jointly the desired random variable and the measurements, s is the estimation function 
and £(X, s) is the error function, for the example, the squared error. Non-Bayesian estimation 
problems can be considered similarly (see, e.g., [23]). (v) Prediction, sequential decision problems 
(see, for example, [18] ) and stochastic control problems [5], such as the linear quadratic Gaussian 
(LQG) problem, as well as general Markov decision processes, are also formalized in terms of 
selecting strategies in order to minimize the expectation of a certain loss function. 

While the criterion of minimizing the expected value of £(X, s) has been predominantly the most 
common one, the exponential moments of £(X,s), namely, E exp{a£(X, s)} (a > 0), have received 
much less attention than they probably deserve in this context. There are a few motivations for 
examining strategies that minimize exponential moments. First, Eexp{a£(X,s)}, as a function of 
a, is obviously the moment -generating function of £(X, s), and as such, it provides the full informa- 
tion about the entire distribution of this random variable, not just its first order moment. Thus, in 
particular, if we are fortunate enough to find a strategy that uniformly minimizes E exp{a£(X, s)} 
for all a > (and there are examples that this may be the case), then this is much stronger than 
just minimizing the first moment. Secondly, exponential moments are intimately related to large- 
deviations rate functions, and so, the minimization of exponential moments may give us an edge on 
minimizing probabilities of (undesired) large deviations events of the form Pr{^(X, s) > Lq} (for 
some threshold Lq), or more precisely, on maximizing the exponential rate of decay of these prob- 
abilities. There are several works along this line, especially in contexts related to buffer overflow 
in data compression [H],[I2], [13] , [15] , [19] , [53] , [26] , and exponential moments related to guessing 

It is natural to ask, in view of the foregoing discussion, how can we harness the existing body of 
knowledge concerning optimization of strategies for minimizing the first moment of £(X, s), which is 
quite mature in many applications, in our quest for optimum strategies that minimize exponential 
moments. Our main basic result, in this paper, is a simple theorem that relates the two criteria. In 
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particular, we furnish sufficient conditions that the optimum strategy in the exponential moment 
sense can be found in terms of the optimum strategy in the first moment sense, for a possibly 
different probability distribution, which our theorem characterizes. 

In some applications, these sufficient conditions for optimality in the exponential moment sense, 
yield an equation in s, whose solution is the desired optimum strategy. It is clear then that in 
these applications, the optimality conditions provide a concrete tool for deriving the optimum 
solution. In other applications, however, this may not be quite the case directly, yet the set of 
optimality conditions may still serve as a useful tool: More often than not, in a given instance 
of the problem under discussion, one may have a natural intuitive guess concerning the optimum 
strategy, and then the optimality conditions can be used to prove that this is the case. One example 
for this, that will be demonstrated in detail later on, is the following: Given n independent and 
identically distributed (i.i.d.) Gaussian observations, X±, . . . ,X n , with mean 0, the sample mean, 
s(Xi,...,X n ) = — i s the optimum unbiased estimator of 9, not merely in the mean 

squared error sense (as is well known), but also in the sense of minimizing all exponential moments 
of the squared error, i.e., E exp{a[s(X\, . . . ,X n ) — 6] 2 } for all a > for which this expectation is 
finite. 

We next devote some attention to the asymptotic regime. Consider the case where X is a random 
vector of dimension n, X = (X\, . . . , X n ), governed by a product-form probability distribution, and 
£{X, s) grows linearly for a given empirical distribution of X, for example, when £(X, s) is additive, 
i.e., £(X,s) = Y^t=i l(Xi,s). In this case, the exponential moments of £(X,s) typically behave (at 
least asymptotically) like exponential functions of n. If we can then select a strategy s that somehow 
"adapts"Q to the empirical distribution of (X\, . . . ,X n ), then such strategies may be universally 
optimum (or asymptotically optimum in the sense of achieving the minimum exponential rate of 
the exponential moment) in that they depend on neither the underlying probability distribution, 
nor on the parameter a. This is demonstrated in several examples, one of which is an extension of 
a well known result by Rissanen in universal data compression [22j . 

An interesting byproduct of the use of the exponential moment criterion in the asymptotic 
regime is the possible existence of phase transitions: In turns out that the asymptotic exponential 
rate of E exp{a£(Xi, . . . , X n , s)} as a function of n, may not be a smooth function of a and/or the 
1 The precise meaning of this will be clarified in the sequel. 
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parameters of the underlying probability distribution even when the model under discussion seems 
rather simple and 'innocent.' This is best understood from a statistical-mechanical perspective, 
because in some cases, the calculation of the exponential moment is clearly analogous to that of the 
partition function of a certain physical system of interacting particles, which is known to exhibit 
phase transitions. It is demonstrated that at least in certain cases, these phase transitions are not 
merely an artifact of a badly chosen strategy, but they appear even when the optimum strategy is 
used, and hence these phase transitions are inherent in the model. 

We end this paper by touching upon yet another aspect of the exponential moment criterion, 
which we do not investigate very thoroughly here, but we believe it is interesting and therefore cer- 
tainly deserves a further study in the future: Even in the ordinary setting, of seeking strategies that 
minimize E{£(X, s)}, optimum strategies may not always be known, and then lower bounds are of 
considerable importance as a reference performance figure. This is a-fortiori the case when expo- 
nential moments are considered. One way to obtain non-trivial bounds on exponential moments 
is via lower bounds the expectation of £(X,s), using the techniques developed in this paper. We 
demonstrate this idea in the context of a lower bound on the expected exponentiated squared error 
of an unbiased parameter estimator, on the basis of the Cramer-Rao bound (CRB), but it should 
be understood that, more generally, the same idea can be applied on the basis of other well-known 
bounds of the mean-square error (Bayesian and non-Bayesian) in parameter estimation, and in 
signal estimation, as well as in other problem areas. 

2 Basic Optimality Conditions 

Let X be a random variable taking on values in a certain alphabet X, and drawn according to a 
given probability distribution P. Let the variable s designate a strategy chosen from some space 
S of allowed strategies. The term "strategy" in our context is fairly generic: it may be a scalar 
variable, a vector, an infinite sequence, a function (of X), a partition of X, a coding scheme for 
X, and so on. Associated with each x G X and s £ S, is a loss £(x,s). The function £(x,s) 
is called the loss function, or the cost function. The operator E{-} will be understood as the 
expectation operator with respect to (w.r.t.) the underlying distribution P, and whenever we 
refer to the expectation w.r.t. another probability distribution, say, Q, we use the notation Eq{-}. 
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Nonetheless, occasionally, when there is more than one probability distribution playing a role at 
the same time and we wish to emphasize that the expectation is taken w.r.t. P, then to avoid 
confusion, we may denote this expectation by Ep{-}. 

For a given a > 0, consider the problem of minimizing E exp{a£(X, s)} across s G S. The 
following theorem relates the optimum s for this problem to the optimum s for the problem of 
minimizing Eq{£(X, s)} w.r.t. another probability distribution Q. 

Theorem 1 Assume that there exists a strategy s G S for which 

Z(s) = E P exp{a£(X, s)} < oo. (1) 

A strategy s G S minimizes E p exp{a£(X , s)} if there exists a probability distribution Q on X that 
satisfies the following two conditions at the same time: 

1. The strategy s minimizes Eq{£(X, s)} over S. 

2. The probability distribution Q is given by 

An equivalent formulation of Theorem 1 is the following: denoting by sq a strategy that minimizes 
Eq{£(X, s)} over S, then the theorem asserts that sq minimizes Ep exp{a£(X, s)} over S if 

Q(x) oc P(x)e a£{x ' s Q\ (3) 

where by A(x) oc B(x), we mean that A{x)/B(x) is a constant, independent of x. 

Proof. Let s G S be arbitrary and let (s*,Q*) satisfy conditions 1 and 2 of Theorem 1. Consider 
the following chain of inequalities: 



a£(X,s)+ln^^| 



> exp{aE Q *£(X,s)-D(Q*\\P)} 

> exp{aE Q *£(X,s*)-D(Q*\\P)} 

( e cd{X,s*) 
= explaE Q «£{X,s*)-E Q *hi- 



Z{s*) J 

Z(s*) = E P exp{a£(X,s*)}, (4) 
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where the first equality results from a change of measure (multiplying and dividing e^ x ' s ' by 
Q*(X)), the second line is by Jensen's inequality and the convexity of the exponential func- 
tion (with D(Q\\P) = Eq1h[Q(X)/P(X)] being the relative entropy between Q and P), the 
third line is by condition 1 of Theorem 1, and the remaining equalities result from condition 
2: On substituting Q*(x) = P{x)e a ^ x > s ^ / Z{s*) into D(Q*\\P), one readily obtains D(Q*\\P) = 

0. Eq*£(X, s*) — lnZ(s*). This completes the proof of Theorem 1. □ 

Observe that for a given s, Jensen's inequality in the second line of becomes an equality 
for Q(x) = P(x)e ae( - X ^/Z{s), since for this choice of Q, the random variable that appears in the 
exponent, a£(X, s)+ln q[ yj > becomes degenerate (constant with probability one). Since the original 
expression is independent of Q, such an equality in Jensen's inequality means that aEg£(X, s) — 
D(Q\\P) is maximized by this choice of Q, a fact which can also be seen from a direct maximization 
of this expression using standard methods. Thus, we have a simple identity for every s: 

E P exp{a£(X, s)} = exp{a max.[E Q £(X, s) - D(Q\\P)]}. (5) 

Q 

This identity will prove useful in several places throughout the sequel. 
Suppose next that the set S and the loss function £(x, s) are such that: 

mmmax.[aE Q e(X, s) - D(Q\\P)] = m^mm[aE Q £(X, s) - D(Q\\P)]. (6) 

seS Q Q sgS 

This equality between the min-max and the max-min means that there is a saddle point (s* , Q*), 
where s* is a solution of the min-max problem on the left-hand side and Q* is a solution to the 
max-min problem on the right-hand side. It is easy to check that the maximizing Q in the inner 
maximization on the left-hand side is Q*(x) = P(x)e a ^ X)S *>/Z(s*), which is condition 2 of Theorem 

1. By the same token, the inner minimization over s on the right-hand side obviously minimizes 
Eq*£(X, s), which is condition 1. It follows then that if the min-max and the max-min are equal, 
then the saddle point satisfies the conditions of Theorem 1, and hence the corresponding s* is 
optimum. Note also that when eq. (0) holds, the conditions of Theorem 1 become also necessary 
conditions for optimality: Suppose that s* is optimum. Then, by eq. ([5]), it must solve the minimax 
problem on the left-hand side of eq. ([6]). But if eq. © holds then there is a saddle point, and s* 
if the first coordinate of this saddle point, (s*, Q*). But then s* and Q* must be related according 
to the conditions of Theorem 1, as explained above. 
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When does eq. © hold? In general, the well-known sufficient conditions for 



minmax f(u,v) = maxmin f(u,v) 



(7) 



are that U and V are convex sets (with U being independent of v and V being independent of u), 
and that / is convex in u and concave in v. In our case, since the function f(s, Q) = cxEq£(X, s) — 
D{Q\\P) is always concave in Q, this sufficient condition would automatically hold whenever £(x, s) 
is convex in s (for every fixed x), provided that S is a space in which convex combinations can be 
well defined, and that S is a convex set. 

Maximizing Negative Exponential Moments. A similar, but somewhat different, criterion per- 
taining to exponential moments, which is reasonable to the same extent, is the dual problem 
of max s6< 5 E exp{— a£(X, s)} (again, with a > 0). If £(x, s) is non-negative for all x and s, this has 
the advantage that the exponential moment is finite for all a > 0, as opposed to E exp{a£(X, s)} 
which, in many cases, is finite only for a limited range of a. For the same considerations as before, 
here we have: 



and so the optimality conditions relating s and Q are similar to those of Theorem 1 (with a 
replaced by —a), except that now we have a double minimization problem rather than a min-max 
problem. However, it should be noted that here the conditions of Theorem 1 are only necessary 
conditions, as for the above equalities to hold, the pair (s, Q) should globally minimize the function 
[aEq£(X, s) + D(Q\\P)], unlike the earlier case, where only a saddle point was sought^ On the 
other hand, another advantage of this criterion, is that even if one cannot solve explicitly the 
equation for the optimum s, then the double minimization naturally suggests an iterative algorithm: 
starting from an initial guess sq E S, one computes Qo(x) cx P(x) exp{— a£(x, sq)} (which minimizes 
[aEg£(X, s) + -D(Q||P)] over {Q}), then one finds si = argmin se< s Eq {£(X, s)}, and so on. It is 
obvious that E exp{— a£(X, s,)}, i = 0,1,2,..., increases (and hence improves) from iteration to 
iteration. This is different from the min-max situation we encountered earlier, where successive 
improvements are not guaranteed. 

2 In other words, it is not enough now that s and Q are in 'equilibrium' in the sense that s is a minimizer for a 
given Q and vice versa. 



max E exp{—a£(X , s)} 



maxexp{mayL[-aE Q £(X,s) - D(Q\\P)]} 
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3 A Few Examples 



Theorem 1 tells us that if we are fortunate enough to find a strategy s £ S and a probability 
distribution Q, which are 'matched' to one another (in the sense defined by the above conditions), 
then we have solved the problem of minimizing the exponential moment. Sometimes it is fairly easy 
to find such a pair (s,Q) by solving an equation. In other cases, there might be a natural guess for 
the optimum s, which can be proven optimum by checking the conditions. In this section, we will 
see examples of both types. Some of these examples could have been also solved directly, without 
using Theorem 1, but for others, this does not seem to be a trivial task. In some of the examples, 
it turns out that the same optimum strategy that minimizes expected loss, is also optimum in the 
sense of minimizing all exponential moments, but this is, of course, not always the case. 

3.1 Example 1: Lossless Data Compression 

We begin with a very simple example. Let X be a random variable taking on values in a finite 
alphabet X , let s be a probability distribution on X , i.e., a vector {s(x), x £ X} with J2 x ex s ( x ) = 1 
and s(x) > for all x G X, and let £(x, s) = — lns(x). This example is clearly motivated by lossless 
data compression, as — lns(x) is the length function (in nats) pertaining to a uniquely decodable 
code that is induced by a distribution s, ignoring integer length constraints. In this problem, one 
readily observes that the optimum s for minimizing Eq{— lns(X)} is sq = Q. Thus, by eq. (J3|), 
we seek a distribution Q such that 



and the expectation of — In sq(X) yields the Renyi entropy. Note that here £(x,s) is convex in 
s and so, the minimax condition holds. While this result is well known and it could have been 
obtained even without using Theorem 1, our purpose in this example was to show how Theorem 1 
gives the desired solution even more easily than with the direct method, by solving a very simple 
equation. 




(9) 



which means [Q(x)] 1+a cx P(x), or equivalently, Q(x) tx [P(x)] 1 ^ 1+Q \ More precisely, 




(10) 
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3.2 Example 2: Bayesian Estimation 



Let (X, Y) be random variables, where Y is distributed according to a given density P{y) and the 
conditional density of X given Y is given by 

P(x\y) = -±= exp [~{x - <Xy)) 2 j , (11) 

where (j>(y) is a given function. We are seeking the optimum linear estimator of X based on the 
observation Y in the sense of minimizing the exponential moment of squared error. In other words, 
we seek a real number s that minimizes Eexp{a(X - sY) 2 }, where a E (0, 1/2). Once again, the 
loss function is convex in s. According to the second condition of Theorem 1, Q should be of the 
form 

1 f 1 

Q{x,y) oc P(y) • -= exp <^ --(x - (f>{y)) 2 + a(x - sy) 2 



= P(y)ex V ^ T -^my)-sy] z ^ -Q(x\y) 

oc P(y)-Q(x\y) (12) 

where Q(x\y) is a Gaussian distribution with mean [4>(y) — 2asy]/(l — 2a) and variance 1/(1 — 2a) 
and 

P(y) a P(y)exp - sy ] 2 | . (13) 

On the other hand, by the first condition of Theorem 1, s should be the coefficient pertaining to 
the optimum linear estimator of X based on Y under Q, which is 

_ E Q (XY) _ E Q {Y • E Q (X\Y)} 

E Q (Y 2 ) Eq{Y 2 ) ■ 1 ) 

But since Q(x\y) is Gaussian with mean [(f>(y) — 2asy]/(l — 2a) as said, then this is exactly the 
inner expectation at the numerator, and so, we obtain 

1 

s = 

1 - 2a 

or equivalently, 



EgjY.^Y)} 
E Q (Y*) 

Eq{Y<P(Y)} 



(15) 



E Q <r>) ' (16) 
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But since these expectations involve only the random variable Y whose marginal under Q is P, 
then the expectations are actually taken under P, i.e., 

Ep{Y<KY)} 

,= E P (Y>) ' (17> 
Note that this is different from the solution to the ordinary MMSE problem, where the solution is 
given by the same expression, but with P being replaced by P. It should be kept in mind that P 
depends, in general, on s, then so does the right hand side of the last equation. We have therefore 
obtained an equation whose solution s = sq is the optimum coefficient in the sense of minimum 
Epexp{a(X — sY) 2 }. Let us now examine a few simple special cases. 

Consider first the case 4>(y) = squ, for some real constant sq. In this case, the right hand-side 
of eq. (|17p is trivially equal to sq, which means that sq = sq. This means that whenever (X,Y) 
is a Gaussian vector, the linear MMSE estimator minimizes also all exponential moments of the 
squared error (among all linear estimators). 

Consider next the case where 

P(y)= l -5(y-l)+ l -5(y + l), (18) 

and denote (/>+ = 4>(+l) and 0_ = <p{— 1). Then 

exp/^^-s) 2 } 

P(y) = — ? H H T -S(y-1) + 



exp 



{^k(</>-+*) 2 } 



" *) 2 } + exp { A(0- + s) 2 } 
Since Ep(Y 2 ) = 1, the equation in s reads 



S(y + 1). (19) 



ex P { " s ) 2 } ~ <f>- ^p {^(0- + s) 2 ~ 



exp {^(0+ - s) 2 } + exp [j^tf- + a] 



2 



(20) 



For <j) + = —(j)-, we get s = <f>+, which is expected since this is actually the linear case discussed 
above, with so = <p+. For <j) + = </>_=</>, the equation reads 

* = -0tanh(^) (21) 
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and the only solution is s = 0, which makes sense since X and Y are independent in this case. For 
a — > 0, the ordinary MMSE linear estimator is recovered, whose coefficient is s = ((/>+ — </>_)/2, as 
expected. 

Returning to the general setting of this example, let us examine what would happen if we expand 
the scope and allow a general, non-linear estimator. In this case, we seek a general function s(y) 
such that 

Q(x,y) «P(y)exp j-i(x-0(2/)) 2 j ■ e W {a(x - s(y)) 2 }. (22) 
If q G (0, 1/2) and we guess s(y) = (p(y), we obtain 

Q(x, y) oc P(y) exp j - Q - a) (x - <P(y)) 2 j (23) 

for which the conditional mean, sg(y) = Eq(X\Y = y), is indeed (j)(y), and so, the conditions of 
Theorem 1 are satisfied. It follows then that for our example, s(y) = E(X\Y = y) = 4>(y), minimizes 
not only the MSE, but also all exponential moments of the squared error, Eexp{a(X — siY)) 2 } 
for < a < 1/2. 

The same idea applies to somewhat more general situations. Let p(t) be an even function, which 
is monotonically non-decreasing for t > 0, and steeply enough so that J_°° dte~^ p ^ < oo for all 
P > (3q, where /3o > is a certain constant. Suppose that P(x,y) oc P(y) exp[— (3p(x — (j)(y))] for 
some (3 > fio, and we are interested in minimizing the exponential moment E exp{ap(X — s(Y))}. 
Then, for every a S (0, — (3q), the choice s(y) = 4>(y) leaves Q(x\y) symmetric about x = 4>(y). 
If r = minimizes J_°° dtp(t — r)e~^ p ^ for every (3 > (3q (which is true in many cases), then the 
estimator s(y) = 4>(y) minimizes all exponential moments of p(X — s(Y)). This can be even further 
generalized to cases where P{x\y) oc exp[— j3p\{x — <p{y))} for a given symmetric function p\ that 
may be different from the function p for which we wish to minimize the exponential moment. 

The above considerations extend also to signal estimation (prediction, filtering, etc.): Consider 
two jointly wide-sense stationary Gaussian processes, {(X n , Y n )}. Given (..., Y_i, io; Y\, . . .), each 
X t is Gaussian, with conditional mean given by E{Xt\..., Y-\, Yq, Yi, . . .} = ^£-oo^*^<-*' 
being the impulse response of the non-causal Wiener filter. From the same reasons as before, the 
exponential moments of the square error are also minimized by the non-causal Wiener filter. It 
is not clear, however, whether the causal Wiener filter minimizes the exponentiated square error 
among all causal filters, unless the non-causal Wiener filter happens to coincide with the causal 
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one. Optimum linear prediction of Gaussian processes in the ordinary mean square error sense are 
also optimum in the mean exponentiated squared error sense. 



3.3 Example 3: Non-Bayesian Estimation 



Let X±, X2, . . . , X n be i.i.d. Gaussian random variables with mean 6 and variance a 2 . It is very well 
known that among all unbiased estimators of 9, the one the minimizes the mean square error (or 
equivalently, the estimation error variance) is the sample mean s(x\, . . . , x n ) = - Y17=i x i- Does the 
sample mean estimator also minimize E exp{a[s(Xi, . . . , X n ) — 9] 2 } among all unbiased estimators 
and for all values of a in the allowed range? 

Once again, the class S of all unbiased estimators is clearly a convex set and (s — 9) 2 is convex 
in s. Let us 'guess' that the sample mean indeed minimizes also E exp{a[s(X\, . . . ,X n ) — 9] 2 } and 
then check whether it satisfies the conditions of Theorem 1. The corresponding probability measure 
Q, which will be denoted here by Q$, is given by 



Qe{xi, ■ 


. , x n ) oc exp < 


1 

"2^2 


n 
i=l 


?) 2 + a 


- Vxi 

\ 1=1 




= exp < 


\ 1 

2a 2 


n 
i=l 


?) 2 + a 


"l " 

n 

. i=l 



0) 



exp 



(x - 9u) T W(x - 9u) 



2a 2 



(24) 



where x = (x\, 



u 



1,1,..., I) 1 GlR n and 



W = I 



2aa 2 



-uu 



v- 



(25) 



/ being the n x n identity matrix. The maximum likelihood estimator of 9 under Qq is given by 

u T Wx 



s(x) 



u T Wu 



i=l 



(26) 



namely, the sample mean. It can easily be shown to achieve the Cramer-Rao lower bound under 
Qe, which is a 2 /(u T Wu). Thus, the sample mean estimator is an optimum unbiased estimator for 
Qq and hence it satisfies the conditions of Theorem 1. The answer to the question of the previous 
paragraph is then affirmative. The best achievable performance, in the exponential moment sense, 
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is given by 




E^ { *\-„yx i -e\ >= 7Sm - (27) 



3.4 Example 4: The Gaussian Joint Source— Channel Coding Problem 

Consider the Gaussian memoryless source 

Pu(u) = (2vr^)-"/ 2 exp|-^^ i 2 | (28) 
and the Gaussian memoryless channel y = x + z, where the noise is distributed according to 

(29) 



^(z) = (2vra 2 2 r"/ 2 e X p|-^ f X;^} 



In the ordinary joint source-channel coding problem, one seeks an encoder and decoder that would 
minimize D = - Y27=i E{(Ui — V^) 2 }, where V = (V%, . . . , V n ) is the reconstruction at the decoder. 
It is very well known that the best achievable distortion, in this case, is given by 



1 + T/aj 

where T is the maximum power allowed at the transmitter, and it may be achieved by a transmitter 



that simply amplifies the source by a gain factor of yTy<7 2 and a receiver that implements linear 
MMSE estimation of Ui given YJ, on a symbol-by-symbol basis. 

What happens if we replace the criterion of expected distortion by the criterion of the expo- 
nential moment on the distortion, E exp{a 'Yl^iUi — ^) 2 }? It is natural to wonder whether simple 
linear transmitters and receivers, of the kind defined in the previous paragraph, are still optimum. 

The random object X, in this example, is the pair of vectors (U,Z), where U is the source 
vector and Z is the channel noise vector, which under P = Pjj x Pg, are independent Gaussian 
i.i.d. random vectors with zero mean and variances cr 2 and cr 2 , respectively, as said. Our strategy 
s consists of the choice of an encoding function x = f(u) and a decoding function v = g{y). 
The class S is then the set of all pairs of functions {/, g}, where / satisfies the power constraint 
-Ep{||/({7)|| 2 } < nr. Condition 2 of Theorem 1 tells us that the modified probability distribution 
of u and z should be of the form 

Q(u,z) ocP C7 (u) J P z (z)exp|a^[n i -< 7i (/(M) + z)] 2 | (31) 
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where is restriction of g to the i-th component of v. 



Clearly, if we continue to restrict the encoder / to be linear, with a gain of yT/cj^, which simply 
exploits the allowed power T, and the only remaining room for optimization concerns the decoder g, 
then we are basically back to the previous example of Bayesian estimation in the Gaussian regime, 
and the optimum choice of the decoder is a linear one, exactly like in the traditional mean square 
error case (from the same consideration as in the Bayesian estimation example). However, once 
we extend the scope and allow / to be a non-linear encoder, then the optimum choice of / and g 
would no longer remain linear like in the expected distortion case. It is not difficult to see that the 
conditions of Theorem 1 are no longer met for any linear functions / and g. The key reason is that 
while Q(u,z) of eq. (|3ip continues to be Gaussian (though now Ui and Zi are correlated) when / 
and g are linear, the power constraint, -Ep{||JT|| 2 } < nT, when expressed as an expectation w.r.t. 
Q, becomes E Q {\\f(U)\\ 2 P(U)/Q{U)} < nT, but "power" function \\f(u)\\ 2 P(u)/Q{u), with P 
and Q being Gaussian densities, is no longer the usual quadratic function of f(u) for which there 
is a linear encoder and decoder that is optimum. 

Another way to see that linear encoders and decoders are suboptimal, is to consider the following 
argument: For a given n, the expected exponentiated squared error is minimized by a joint source- 
channel coding system, defined over a super-alphabet of n-tuples, with respect to a distortion 
measure, defined in terms of a single super-letter, as 



it 



d(u,v) = exp ja^(iii - Vi) 2 j . (32) 

For such a joint source-channel coding system to be optimal, the induced channel P(v\u) must [H 
p. 31, eq. (2.5.13)] be proportional to 



P(v) exp{— /3d(ii, v)} = P(v) exp 



-/3exp 



(33) 



for some /3 > 0, which is the well-known structure of the optimum test channel that attains the 
rate-distortion function for the Gaussian source and the above defined distortion measure. Had 
the aforementioned linear system been optimum, the optimum output distribution P(v) would 
be Gaussian, and then P(v\u) would remain proportional to a double exponential function of 
^2i{ui — Vi) 2 . However, the linear system induces instead a Gaussian channel from u to v, which 
is very different, and therefore cannot be optimum. 
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Of course, the minimum of E exp{a^2AUi — Vi) 2 } can be approached by separate source- and 
channel coding, defined on blocks of super-letters formed by n-tuples. The source encoder is an 
optimum rate-distortion code for the above defined 'single-letter' distortion measure, operating at 
a rate close to the channel capacity, and the channel code is constructed accordingly to support 
the same rate. 

4 Universal Asymptotically Optimum Strategies 

The optimum strategy for minimizing Ep exp{a£(X, s)} depends, in general, on both P and a. It 
turns out, however, that this dependence on P and a can sometimes be relaxed if one gives up the 
ambition of deriving a strictly optimum strategy, and resorts to asymptotically optimum strategies. 

Consider the case where, instead of one random variable X, we have a random vector X = 
{X\, . . . ,X n ), governed by an product form probability function 

n 

^)=lt^), (34) 

i=l 

where each component X{ of the vector x = (xi, . . . ,x n ) takes on values in a finite set X. If the 
£(x, s) grows linearly^ with n for a given empirical distribution of x and a given s G S, then it is 
expected that the exponential moment E exp{a£(x, s)} would behave, at least asymptotically, as 
an exponential function of n. In particular, for a given s, the limit 

lim — In Ep exp{a£(X , s)} 

n— »oo n 

exists. Let us denote this limit by E(s, a, P). An asymptotically optimum strategy is then a strategy 
s* for which 

E(s*,a,P) < E(s,a,P) (35) 

for every s G S. An asymptotically optimum strategy s* is called universal asymptotically optimum 
w.r.t. a class V of probability distributions, if s* is independent of a and P, yet it satisfies eq. 
(|35p for all a in the allowed range, every s G S, and every P G V . In this section, we take V to 
be the class of all memoryless sources with a given finite alphabet X . We denote by Tq the type 
class pertaining to an empirical distribution Q, namely, the set of vectors x G X n whose empirical 
distribution is Q. 



This happens, for example, when I is additive, i.e., t(x, s) = Y^7=i ^( Xi > s ) 
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Suppose there exists a strategy s* and a function A : V — > IR such that following two conditions 
hold: 



(a) For every type class Tq and every x G Tq, £(x,s*) < n[X(Q) + o(n)], where o{n) designates 
a (positive) sequence that tends to zero as n — > oo. 



(b) For every type class Tq and every s € S, 



T Q n{x: £(x, s) > n[X(Q) - o(n)]} 



> e -no(n)| Tg |_ ( 36 ) 



It is then a straightforward exercise to show, using the method of types, that s* is a universal 
asymptotically optimum strategy w.r.t. V , with 

E(s*,a, P) = max[aA(Q) - D(Q\\P)], (37) 
Q 

where condition (a) supports the direct part and condition (b) supports the converse part. The 
interesting point here then is not quite in the last statement, but in the fact that there are quite a 
few application examples where these two conditions hold at the same time. 

Before we provide such examples, however, a few words are in order concerning conditions (a) 
and (b). Condition (a) means that there is a choice of s* , that does not depend on x or on its 

n 

type classp yet the performance of s*, for every x £ Tq, "adapts" to the empirical distribution Q 
of a; in a way, that according to condition (b), is "essentially optimum" (i.e., cannot be improved 
significantly), at least for a considerable (non-exponential) fraction of the members of Tq. It is 
instructive to relate conditions (a) and (b) above to conditions 1 and 2 of Theorem 1. First, observe 
that in order to guarantee asymptotic optimality of s*, condition 2 of Theorem 1 can be somewhat 
relaxed: For Jensen's inequality in @ to remain exponentially tight, it is no longer necessary to 
make the random variable a£(X,s) + ln[P(X)/Q(X)] completely degenerate (i.e., a constant for 
every realization x, as in condition 2 of Theorem 1), but it is enough to keep it essentially fixed 
across a considerably large subset of the dominant type class, Tq*, i.e., the one whose empirical 
distribution Q* essentially achieves the maximum of [aX(Q) — D(Q\\P)]. Taking Q*(x) to be 
the memoryless source induced by the dominant Q* , this is indeed precisely what happens under 



As before, s* is chosen without observing the data first. 
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conditions (a) and (b), which imply that 



al{x, s* 



)+ln 



P{x) 



na\(Q) +y^ln 
i=i 




Q*{x) 



na\(Q) + n V] Q*(x) In 



P(x) 



n[aA(Q*)-D(Q*||P)], 



(38) 



for (at least) a non-exponential fraction of the members of Tq* , namely, a subset of Tq* that is large 
enough to maintain the exponential order of the (dominant) contribution of Tq* to E exp{a^(a;, s*)}. 
Loosely speaking, the combination of conditions (a) and (b) also means then that s* is essentially 
optimum for (this subset of) Tq*, which is a reminiscence of condition 1 of Theorem 1. More- 
over, since s* "adapts" to every Tq, in the sense explained above, then this has the flavor of the 
max-min problem discussed in Section 2, where s is allowed to be optimized for each and every Q. 
Since the minimizing s, in the max-min problem, is independent of P and a, this also explains the 
universality property of such a strategy. 

Let us now discuss a few examples. The first example is that of fixed-rate rate-distortion 
coding. A vector X that emerges from a memoryless source P is to be encoded by a coding scheme 
s with respect to a given additive distortion measure, based on a single-letter distortion measure 
d : X X X — > IR, X being the reconstruction alphabet. Let Dq(R) denote the distortion-rate 
function of a memoryless source Q (with a finite alphabet X) relative to the single-letter distortion 
measure d and let t{x, s) designate the distortion between the source vector x and its reproduction, 
using a rate-distortion code s. It is not difficult to see that this example meets conditions (a) and 
(b) with X(Q) = Dq(R): Condition (a) is based on the type covering lemma Section 2.4], 
according to which each type class Tq can be completely covered by essentially e nR 'spheres' of 
radius uDq{R) (in the sense of d), centered at the reproduction vectors. Thus s* can be chosen 
to be a scheme that encodes x in two parts, the first of which is a header that describes the index 
of the type class Tq of x (whose description length is proportional to log n) and the second part 
encodes the index of the codeword within Tq, using nR nats. Condition (b) is met since there is 
no way to cover Tq with exponentially less than e nR spheres within distortion less than Dq(R). 

By the same token, consider the dual problem of variable-rate coding within a maximum allowed 
distortion D. In this case, every source vector x is encoded by t(x, s) nats, and this time, conditions 
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(a) and (b) apply with the choice A(Q) = Rq(D), which is the rate-distortion function of Q (the 
inverse function of Dq(R)). The considerations are similar to those of the first example. It is 
interesting to particularize this example, of variable-rate coding, to the lossless case, D = (thus 
revisiting Example 1), where Rq(0) = Hq, the empirical entropy associated with Q. In this case, 
a more refined result can be obtained, which extends a well known result due to Rissanen [22] in 
universal data compression: According to [22], given a length function of a lossless data compression 
£(x, s) (s being the data compression scheme), and given a parametric class of sources {Pg}, indexed 
by a parameter 9 G C IR fe , a lower bound on Eg£[X , s), that applies to mos^ values of 9, is 
given by 

Eg£(X,s)>nH e + (l-e)^logn, (39) 

where e > is arbitrarily small (for large n), Hg is the entropy associated with Pg, and Eg{-} is 
the expectation under Pg. On the other hand, the same expression is achievable, by a number of 
universal coding schemes, provided that the factor (1 — e) in the above expression is replaced by 
(1 + e). Consider now the case where {Pg, 9 £ 0} is the class of all memoryless sources over X, 
where the parameter vector 9 designates k = \X\ — 1 letter probabilities. As for a lower bound, we 
have 



In E P exp{al(X,s)} > max [aE Q £(X, s) - nD(Q\\P)) 

Q 

k 

uHq + (1 - e)- Inn 



> max < a 



nD{Q\\P) 



k 

= nmax [chHq — D(Q\\P)] + a(l — e)— Inn 
Q ^ 

= na#i/(i +Q) (P) + a(l-e)^lnn, (40) 

where the second line follows from Rissanen's lower bound (for most sources), and where H U (P) is 
Renyi's entropy of order u, namely, 



H U (P) = j^— In 
1 — u 



Y^P^T ■ (41) 

Ixex 

Consider now a two-part code s* , which first encodes the index of the type class Q and then the 



index of x within the type class. The corresponding length function is given by 

k 

£(x,s*) = hi\T Q \ + A; Inn « nH(x) + - Inn, (42) 



5 "Most values of 6" means all values of 8 with the possible exception of a subset of Q whose Lebesgue measure 
tends to zero as n tends to infinity. 
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where H(x) is the empirical entropy pertaining to x, and where the approximate inequality is easily 
obtained by the Sterling approximation. Then, 

k 

hi E p exp{a£(X , s)} = In E exp{anH(X)} + a— Inn 

k 

= In Ep exp{a min[— In Q(X)]} + a— Inn 
Q 2 

k 

< min In Ep exp{— a In Q(X)} + a— Inn 
Q 2 

k 

= naH 1/{1+a) (P) + a- Inn, (43) 

and then it essentially achieves the lower bound. Rissanen's result is now a special case of this, 
corresponding to a — > 0. 

Our last example corresponds to a secrecy system. A sequence x is to be communicated to a 
legitimate decoder which shares with the transmitter a random key z of nR purely random bits. 
The encoder transmits an encrypted message y = 4>(x, z), which is an invertible function of x given 
z, and hence decipherable by the legitimate decoder. An eavesdropper, which has no access to the 
key z, submits a sequence of guesses concerning x until it receives an indication that the last guess 
was correct (e.g., a correct guess of a password admits the eavesdropper into a secret system). For 
the best possible encryption function (f>, what would be the optimum guessing strategy s* that the 
eavesdropper may apply in order to minimize the a-th moment of the number of guesses G(X , s), 
i.e., E{G a (X, s)}? In this case, £(x,s) = lnG(x, s). As is shown in [17], there exists a guessing 
strategy s* , which for every x 6 Tq, gives £(x,s*) ~ n minji^Q, R}, a quantity that essentially 
cannot be improved upon by any other guessing strategy, for most members of Tq. In other words, 
conditions (a) and (b) apply with \(Q) = min{HQ,R}. 

5 Phase Transitions 

Another interesting aspect of the asymptotic behavior of the exponential moment is the possible 
appearance of phase transitions, i.e., irregularities in the exponent function E(s, a, P) even in some 
very simple and 'innocent' models. By irregularities, we mean a non-smooth behavior, namely, 
discontinuities in the derivatives of E(s,a,P) with respect to a and/or the parameters of the 
source P. 



19 



One example that exhibits phase transitions is that of the secrecy system, mentioned in the 
last paragraph of the previous section. As is shown in [T7], the optimum exponent E(s*,a,p) for 
this case consists of two phase transitions as a function of R (namely, three different phases). In 
particular, 

( aR R<H(P) 
E(s*,a,P) = \ (a-6 R )R + 9 R H 1/{l+eR) (P) H{P)<R<H(P a ) (44) 
I uH 1/(1+a) (P) R>H(P a ) 

where P a is the distribution defined by 

pl/(l+a) / \ 

H(Q) is the Shannon entropy associated with a distribution Q, H U (Q) is the Renyi entropy of 
order u as defined before, and 8r is the unique solution of the equation R = H(Pg) for R in the 
range H(P) < R < H(P a ). But this example may not really be extremely surprising due to the 
non-smoothness of the function X(Q) = minjffg, R}. 

It may be somewhat less expected, however, to witness phase transitions also in some very 
simple and 'innocent' looking models. One way to understand the phase transitions in these cases, 
comes from the statistical-mechanical perspective. It turns out that in some cases, the expression 
of the exponential moment is analogous to that of a partition function of a certain many-particle 
physical system with interactions, which may exhibit phase transitions and these phase transitions 
correspond the above-mentioned irregularities. 

We now demonstrate a very simple model, which has phase transitions. Consider the case where 
X is a binary vector whose components take on values in X = { — 1, +1}, and which is governed by 
a binary memoryless source P^ with probabilities Pr{Xj = +1} = 1 — Pr{Xj = —1} = (1 + fx)/2 
{(i designating the expected 'magnetization' of each binary spin Xi, to make the physical analogy 
apparent). The probability of x under P^ is thus easily shown to be given by 

Consider the estimation of the parameter /i by the ML estimator 



1 n 

-J>. (47) 



n . 
i=i 
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How does the exponential moment of exp{an(/i— /u) 2 } behave like? A straightforward derivation 
yields 



1-H 



x 

2\ n/2 



The last summation over {a;} is exactly the partition function pertaining to the Curie-Weiss model 
of spin arrays in statistical mechanics (see, e.g., [211 Subsection 2.5.2]), where the magnetic field is 
given by 

B = - In -iif _ 2an (48) 
2 1 — 

and the coupling coefficient for every pair of spins is J = 2a. It is well known that this model 
exhibits phase transitions pertaining to spontaneous magnetization below a certain critical temper- 
ature. In particular, using the method of types [7], this partition function can be asymptotically 
evaluated as being of the exponential order of 

1 + m\ J 2 

h 2 ( — - — I + Bm + — • m 



exp < n • max 

|m|<l 



where /i2( - ) is the binary entropy function, which stands for the exponential order of the number 
of configurations {a;} with a given value of m = This expression is clearly dominated 

by a value of m (the dominant magnetization m*) which maximizes the expression in the square 
brackets, i.e., it solves the equation 

m = tanh(Jm + B), (49) 

or in our variables, 

, / 1 , 1 + a \ . , 

m = tanh 2am H — In 2afi . (50) 

V 2 1-/* / 
For a < 1/2, there is only one solution and there is no spontaneous magnetization (paramagnetic 

phase). For a > 1/2, however, there are three solutions, and only one of them dominates the 

partition function, depending on the sign of B, or equivalently, on whether a > (Xq(h) = j- In 

or a < ao(fi) and according to the sign of /i. Accordingly, there are five different phases in the plane 

spanned by a and //. The paramagnetic phase a < 1/2, the phases {/x > 0, 1/2 < a < ao(fi)} and 
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{/x < 0, a > ao(/u)}, where the dominant magnetization m is positive, and the two complementary 
phases, {/i < 0, 1/2 < a < ao(fi)} an d and {fi > 0, a > ao(/i)}, where the dominant magnetization 
is negative. Thus, there is a multi-critical point where the boundaries of all five phases meet, which 
the point (/i, a) = (0, 1/2). The phase diagram is depicted in Fig. 1. 
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Figure 1: Phase diagram in the plane of (n,a). 

Yet another example of phase transitions is that of fixed-rate lossy data compression, discussed 
in the previous section. To demonstrate this explicitly, consider the binary symmetric source 
(BSS) and the Hamming distortion measure d, and consider a random selection of a r&te-R code 
by ne nR independent fair coin tosses, one for each of the n components of every one of the e nR 
codewords. It was shown in j!6j that the asymptotic exponent of the negative exponential moment, 
E exp{— a d(Ui, Vi)} (where the expectation is w.r.t. both the source and the random code 
selection), is given by the following expression, which obviously exhibits a (second order) phase 
transition: 

-aS(R) a < a(R) 



lim - In E exp \ -a V d(U i: Vi) \ = l 

n-s-oo n [ 



(51) 



-a + ln(l + e a ) + i?-ln2 a > a(R) 

where 5(R) is the distortion-rate function of the BSS w.r.t. the Hamming distortion measure and 

1 - 5(R) 



a(R) = In. 



S(R) 



(52) 



The analysis in [16j is based on the random energy model (REM) , [8] , [9] , [TO] , a well-known statistical- 
mechanical model of spin glasses with strong disorder, which is known to exhibit phase transitions. 
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Moreover, it is shown in [IB] that ensembles of codes that have an hierarchical structure may have 
more than one phase transition. 

6 Lower Bounds on Exponential Moments 

As explained in the Introduction, even in the ordinary setting, of the quest for minimizing E{£(X, s)}, 
optimum strategies may not always be known, and then useful lower bounds are very important. 
This is definitely the case when exponential moments are considered, because the exponential mo- 
ment criterion is even harder to handle. To obtain non-trivial bounds on exponential moments, we 
propose to harness lower bounds the expectation of £(X,s), possibly using a change of measure, 
in the spirit of the proof of Theorem 1 and the previous example of a lower bound on universal 
lossless data compression. We next demonstrate this idea in the context of a lower bound on the 
expected exponentiated squared error of an unbiased estimator, on the basis of the Cramer-Rao 
bound (CRB). The basic idea, however, is applicable more generally, e.g., by relying on other well- 
known Bayesian/non-Bayesian bounds on the mean-square error (e.g., the Weiss- Weinstein bound 
for Bayesian estimation [25]), as well as in bounds on signal estimation (filtering, prediction, etc.), 
and in other problem areas as well. Further investigation in the line may be of considerable interest. 

Consider a parametric family of probability distributions {Pe, 9 £ G}, G C ]R being the 
parameter set, and suppose that we are interested in a lower bound on Eq ex.p{a(9 — 9) 2 }, for any 
unbiased estimator of 9, where as before, Eg denotes expectation w.r.t. Pg. Consider the following 
chain of inequalities, which holds for any 9' £ Q: 



where CRB(#) is the Cramer-Rao bound for unbiased estimators, computed at 9 (i.e., CRB(#) = 
1/1(9), where 1(9) is the Fisher information). Since this lower bound applies for every 9' 6 G, one 
can take its supremum over 9' £ Q and obtain 





(53) 



lnE e exp{a(9-9) 2 } > sup [aCRB(0') + a(9' - 9) 2 - D(P e > \\P e )] . 



(54) 



6»'G0 
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More generally if 9 = (0i, . . . ,0k) T is a parameter vector (thus 9 £ © C IR fe ) and a £ IR fc is an 
arbitrary deterministic (column) vector, then 

In E e exp{a T (9 - 9)(9 - 6) T a] > sup \a T r l (9')a + [a 1 \0' - 9)} 2 - D(P e ,\\P e )} , (55) 

e'ee 

where here 1(9) is the Fisher information matrix and I~ 1 (9) is its inverse. 

It would be interesting to further investigate bounds of this type, in parameter estimation in 
particular, and in other problem areas in general, and to examine when these bounds may be tight 
and useful. 
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